Generating Images From Spoken Descriptions

نویسندگان

چکیده

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages estimated be lacking a commonly used written form. Consequently, these cannot benefit text-based technologies. This paper presents 1) new speech technology task, i.e., speech-to-image generation (S2IG) framework which translates descriptions photo-realistic images 2) without using any information, thus allowing unwritten potentially this technology. The proposed framework, referred S2IGAN, consists embedding network relation-supervised densely-stacked generative model. learns embeddings with supervision corresponding visual information images. model synthesizes images, conditioned on produced by network, that semantically consistent spoken descriptions. Extensive experiments conducted four public benchmark databases: two databases in text-to-image tasks, CUB-200 Oxford-102 for we created synthesized descriptions, natural often field cross-modal learning Flickr8k Places. Results demonstrate effectiveness S2IGAN synthesizing high-quality semantically-consistent signal, yielding good performance solid baseline S2IG task.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generating Tailored, Comparative Descriptions in Spoken Dialogue

We describe an approach to presenting information in spoken dialogues that for the first time brings together multi-attribute decision models, strategic content planning, state-of-the-art dialogue management, and realization which incorporates prosodic features. The system selects the most important subset of available options to mention and the attributes that are most relevant to choosing bet...

متن کامل

Midge: Generating Descriptions of Images

We demonstrate a novel, robust vision-tolanguage generation system called Midge. Midge is a prototype system that connects computer vision to syntactic structures with semantic constraints, allowing for the automatic generation of detailed image descriptions. We explain how to connect vision detections to trees in Penn Treebank syntax, which provides the scaffolding necessary to further refine ...

متن کامل

Image2speech: Automatically generating audio descriptions of images

This paper proposes a new task for artificial intelligence. The image2speech task generates a spoken description of an image. We present baseline experiments in which the neural net used is a sequence-to-sequence model with attention, and the speech synthesizer is clustergen. Speech is generated from four different types of segmentations: two that require a language with known orthography (word...

متن کامل

Generating Descriptions of Spatial Relations between Objects in Images

We investigate the task of predicting prepositions that can be used to describe the spatial relationships between pairs of objects depicted in images. We explore the extent to which such spatial prepositions can be predicted from (a) language information, (b) visual information, and (c) combinations of the two. In this paper we describe the dataset of object pairs and prepositions we have creat...

متن کامل

Generating Auction Conngurations from Declarative Contract Descriptions

This work presents an approach to automating the negotiation of business contracts and describes an implementation of a subset of this overall goal. To support automated contract negotiation, we are developing a language for both (1.) fully-speciied, executable contracts and (2.) partially-specied contracts that are in the midst of being negotiated, speciically via automated auctions. The langu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2021

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2021.3053391